Applying Bayesian Statistics and MCMC to Diabetes Risk Prediction
Data and goal
Diabetes is a chronic health condition that affects millions of people worldwide. If not properly managed, it can lead to severe health complications such as heart disease, kidney failure, and blindness. Early detection of diabetes is crucial as it allows for timely intervention and management, which can significantly improve the quality of life for individuals and reduce the overall burden on healthcare systems.
The chosen dataset contains medical and demographic data of patients along with their diabetes status, whether positive or negative. It consists of various features such as age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level.
The primary objective of this project is to develop a robust predictive model capable of accurately determining whether a patient has diabetes based on their medical and demographic data.
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is essential to give me an insight of the data I am dealing with, to understand what kind of features are there and it also allows us to check the consistency of the dataset (missing values, duplicates, imbalances).
Check for missing values
The first step is to check for missing values. I wrote an easy script to check whether in the dataset there are NAs. Luckily, as shown in the table below, in this dataset there are not missing values.
| missing_values | |
|---|---|
| gender | 0 |
| age | 0 |
| hypertension | 0 |
| heart_disease | 0 |
| smoking_history | 0 |
| bmi | 0 |
| HbA1c_level | 0 |
| blood_glucose_level | 0 |
| diabetes | 0 |
Check unique values
Extracting unique values can be useful to understand the nature of the features. In fact, by looking at the table below there are three binary variables (diabetes, hypertension and heart_disease) and there are 2 categorical variables (gender and smoking_history).
| unique_values | |
|---|---|
| gender | 2 |
| age | 94 |
| hypertension | 2 |
| heart_disease | 2 |
| smoking_history | 6 |
| bmi | 1100 |
| HbA1c_level | 18 |
| blood_glucose_level | 18 |
| diabetes | 2 |
Check for imbalanced data
Since mine is binary classification problem where I have to predict
the diabetes status of a patient, measuring class imbalance in a key
step, since having an imbalanced dataset can negatively affect the
model. Looking at the plot below, it’s clear the difference between the
two classes, where the positive class 0 (no diabetes) is the majority
one, while the negative class 1 (diabetes) is the minority one, with a
ratio of 9:1.
Check for duplicates
Also looking for duplicates is a key step when cleaning the dataset. Apparently, in this dataset there are few duplicates, which I will just drop.
## [1] 9
Descriptive statistics
In the end, another useful information is contained in the summary of my numeric variables (age, bmi, HbA1c and glucose level). Notice that the minimum value for ‘age’ is 0.08, which is the representation of 1 month, or rather 1/12 of one year. From these statistics is obvious the presence of outliers, especially in blood_glucose_level and bmi.
## age bmi HbA1c_level blood_glucose_level
## Min. : 0.16 Min. :11.94 Min. :3.500 Min. : 80.0
## 1st Qu.:24.00 1st Qu.:23.74 1st Qu.:4.800 1st Qu.:100.0
## Median :44.00 Median :27.32 Median :5.800 Median :140.0
## Mean :42.47 Mean :27.54 Mean :5.534 Mean :138.1
## 3rd Qu.:60.00 3rd Qu.:29.95 3rd Qu.:6.200 3rd Qu.:159.0
## Max. :80.00 Max. :88.72 Max. :9.000 Max. :300.0
Data Visualization
As written above, this dataset contains features that are more or less related to the disease and now I will dig into the dataset to get a better view of those features.
Numeric variables
First I’ll look to how numeric variables are distributed. I notice that there are some over-represented values in the ‘age’ (80) and in the ‘bmi’ (27.5) fields.
It can also be interesting plotting the box-plots of such numeric variables. For example, by looking at the age is obvious how this disease tends to affect more people as they grow older. For the BMI we can see how it tends to be higher for people with diabetes. Also blood glucose level and HbA1c levels have plenty of outliers, interquartile bodies that appear higher in both variables.
The presence of outliers is obvious in features like BMI, HbA1c level and Blood Glucose Level. The importance of these outliers is crucial into an healthcare context because may help us detect the disease.
Categorical features
So it looks like the percentage of males with diabetes is slightly higher than for females. Moreover, among the smoking history appears
Binary variables
As shown above, the percentage of people with diabetes is way higher among people with either Hypertension or Heart Disease.
Meaning of the features
After some researches I can highlight the following relationships between the above showed features and the disease:
Age:
As we age we face a gradual decline in the ability to produce insulin, a
hormone crucial for regulating blood sugar levels. This decline in
insulin production, coupled with other age-related factors,
significantly increases the risk of developing diabetes, particularly
type 2 diabetes.
Body Mass Index (BMI):
Excess body weight, often measured by Body Mass Index (BMI), is a
prominent risk factor for type 2 diabetes. When we carry extra weight
our cells become less responsive to insulin, making it harder for the
body to utilize glucose effectively. The ideal value is below 25
(normal-weight), between 25 and 30 someone is over-weight while above 30
obesity is diagnosed.
HbA1c:
HbA1c, also known as glycated hemoglobin, is a test that provides an
average measure of blood sugar control over the past 2-3 months.
Elevated HbA1c levels serve as a strong indicator of diabetes,
reflecting consistently high blood sugar levels over time. The ideal
value should be around 6%, but also in this case there are a lot of
outliers.
Blood Glucose Level:
Blood glucose level is a direct measure of the amount of sugar present
in the blood at a specific point in time. While not a definitive
diagnosis, consistently high blood sugar readings can raise concerns
about diabetes. The ideal range should be between 70 mg/dL (3.9 mmol/L)
and 100 mg/dL.
Hypertension:
Hypertension, or high blood pressure, not only puts a strain on the
heart and blood vessels but can also indirectly increase the risk of
diabetes. Chronic high blood pressure can damage blood vessels and
impair insulin production, making it harder for the body to manage blood
sugar levels effectively.
Gender:
While gestational diabetes, a type of diabetes that develops during
pregnancy, is more common in women, men and women have similar risk
factors for type 2 diabetes. However, certain factors, such as
polycystic ovary syndrome (PCOS), can increase a woman’s risk of
developing type 2 diabetes.
Smoking:
Smoking is a well-established risk factor for various health problems,
including diabetes. The harmful chemicals in cigarettes can damage the
cells that produce insulin, leading to insulin resistance and an
increased risk of developing diabetes. However, this is not a key factor
for diabetes
The specific impact of each factor may vary depending on an individual’s overall health, lifestyle, and family history. Maintaining a healthy weight, engaging in regular physical activity, and following a balanced diet are crucial steps in reducing the risk of developing diabetes.
Correlation matrix
Apparently, there are not very strong correlations among the numeric variables. The strongest one can be identified between ‘bmi’ and ‘age’, which may be an indicator of the fact that as people grow older, they no longer worry about controlling their weight, which may be one of the cause of this disease. The second strongest correlation is between the Blood Glucose Level and the HbA1c Level, because both of them are metrics related to the amount of sugar in the blood.
Data pre-processing
Label Encoding
After looking at the data I noticed that there are two categorical columns with text, which are ‘gender’ and ‘smoking_history’, which I will transform to integer through label encoding:
# Transform categorical variables to integers
data$gender <- as.integer(factor(data$gender))
data$smoking_history <- as.integer(factor(data$smoking_history))Class Balancing
I will handle class balancing later because I want to make a comparison between how the model behaves with both balanced and unbalanced classes.
Model - Logistic Regression
My model needs to predict whether a patient has or has not diabetes based on his condition, therefore the problem can be formalized in the following way:
\[ Y = \left\{ \begin{array}{cl} 0 & \text{No diabetes} \\ 1 & \text{Diabetes} \end{array} \right. \]
I will first try to perform a classic logistic regression without handling class imbalance, and then I will make the comparison with the performance over the balanced set. In the next section I will perform the same analysis but in a Bayesian perspective.
Metrics used
In a diagnostic model we should not focus on the accuracy only, but also on specificity and sensitivity. Accuracy assumes that all the mistakes have the same weight, but this is not true in an healtcare context. Therefore, these are the chosen metrics:
Accuracy: tells us how often the model is right,
regardless of the nature of the errors.
Specificity: in a diagnostic model represents the
percentage of people who test negative for a specific disease among a
group of people who do not have the disease, i.e. how good the model
recognize who does not have the disease.
Sensitivity: in a diagnostic model represents the
percentage of people who test positive for a specific disease among a
group of people who actually have the disease, i.e. how good the model
is at correctly identifying those with the disease.
Unbalanced set
I will first make predictions by fitting a simple Logistic Regression model without balancing the classes, to see how the model behaves with such condition.
##
## Call:
## glm(formula = diabetes ~ age + gender + bmi + hypertension +
## heart_disease + smoking_history + HbA1c_level + blood_glucose_level,
## family = binomial(link = "logit"), data = X_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -30.941525 2.610315 -11.854 < 2e-16 ***
## age 0.062875 0.009609 6.544 6.01e-11 ***
## gender 0.040127 0.276622 0.145 0.8847
## bmi 0.159866 0.023145 6.907 4.95e-12 ***
## hypertension -0.118768 0.417529 -0.284 0.7761
## heart_disease 1.141435 0.452685 2.521 0.0117 *
## smoking_history -0.133526 0.098331 -1.358 0.1745
## HbA1c_level 2.476897 0.299530 8.269 < 2e-16 ***
## blood_glucose_level 0.035687 0.003809 9.369 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1049.03 on 1599 degrees of freedom
## Residual deviance: 375.38 on 1591 degrees of freedom
## AIC: 393.38
##
## Number of Fisher Scoring iterations: 8
## Accuracy: 0.96
## Specificity: 0.9862
## Sensitivity: 0.7105
By looking at these results it’s clear how the model is able to reach an high level of accuracy, but this is given by the imbalance of classes. As a matter of fact, despite reaching a really high specificity, the sensitivity is low. Meaning, the model is not able to properly classify a significant portion of negative classes (1 - diabetes), while being able to correctly classify the majority class (about 90% of the set).
Balanced set
First I will balance the classes. To do so, I took advantage of the ROSE (Random Over-Sampling Examples), which will allow me to get a new dataset with balanced classes:
##
## 0 1
## 1010 990
This is the new model trained on a dataset with balanced classes:
##
## Call:
## glm(formula = diabetes ~ age + gender + bmi + hypertension +
## heart_disease + smoking_history + HbA1c_level + blood_glucose_level,
## family = binomial(link = "logit"), data = X_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -17.270549 0.988200 -17.477 < 2e-16 ***
## age 0.052552 0.004494 11.694 < 2e-16 ***
## gender 0.096515 0.148749 0.649 0.51644
## bmi 0.087127 0.011751 7.414 1.22e-13 ***
## hypertension -0.162598 0.245196 -0.663 0.50724
## heart_disease 0.643659 0.247437 2.601 0.00929 **
## smoking_history -0.163530 0.055899 -2.925 0.00344 **
## HbA1c_level 1.409715 0.100762 13.991 < 2e-16 ***
## blood_glucose_level 0.023285 0.001885 12.352 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2218.0 on 1599 degrees of freedom
## Residual deviance: 937.1 on 1591 degrees of freedom
## AIC: 955.1
##
## Number of Fisher Scoring iterations: 6
## Accuracy: 0.875
## Specificity: 0.8627
## Sensitivity: 0.8878
Comparison of the models
P-value
The p-values indicate the probability of observing a value as extreme as the z-value under the null hypothesis (that the coefficient equals zero).
- Low p-value (< 0.05): the feature is a significant predictor.
- High p-value (> 0.05): the feature may not be a significant predictor.
The intercept is lower in the unbalanced model, due to the class imbalance, suggesting a lower baseline risk of diabetes. Age, BMI, HbA1c levels, and blood glucose levels are significant predictors in both models. Gender and hypertension are significant in both models, with hypertension having a stronger effect in the balanced model. Heart disease is significant only in the balanced model, while smoking history is not significant in either model.
Deviance and AIC
The unbalanced model has a lower residual deviance and AIC, indicating a better fit to the training data but also a potential for over-fitting. The balanced model, on the other hand, is likely to generalize better to new data, providing more reliable predictions by capturing the effects of predictors more accurately. Therefore, balancing the dataset improved the robustness of the model.
Metrics
If we look at the metrics obtained on the balanced dataset, we can notice how the accuracy and specificity decreased a little bit. On the other hand, the sensitivity has definitely increased, which is an huge improvement and suggests that the balanced dataset model is better at correctly identifying negative class (1 - diabetes), that is our actual goal.
Model - Bayesian Logistic Regression
Since my goal is to classify a patient where there are two possible outcome (diabetes or no diabetes) the best choice for modelling the data is the Bernoulli distribution. If \(\pi\) is the probability that a patient has the diabetes, I can define the distribution of the outcomes Y as:
\[ Y_i|\pi_i \sim Ber(\pi_i) \]
All the other features are my predictors. Therefore, since I am performing a Logistic regression, I can put forth the following interpretation:
If \(Y\) is the binary indicator that an event occurs with probability \(\pi\), I can write the logistic regression model of \(Y\) with predictors \((X_1,...,X_8)\):
\[ log( \frac{\pi}{1-\pi}) = \beta_0 + \beta_1 X_1 + ... + \beta_8 X_8 \] Where \(log (\frac{\pi}{1-\pi})\) are the log-odds that a patient has the disease.
The next step to complete my Bayesian logistic regression model for \(Y\) is to define the priors for the regression parameters \(\beta_i\), with \(i=0,..,8\). Since I have no information about the prior I chose a Non-Informative priors, also called Weakly Informative. Such priors are really useful when we want them to have minimum influence over the inference:
\[ \begin{array}{cl} &\beta_0 \sim N(0,1.67) \\ &\beta_i\sim N(0,10), \quad i = 1,...,8 \qquad i \ne 7 \\ &\beta_7 \sim N(0,1.67) \\ \end{array} \]
Setting \(\mu = 0\) means that there is no prior inclination towards positive or negative values for the coefficients, while the variance will initially be set to a large value because it reflects an high level of uncertainty and therefore it is not informative. In the first place I am setting the \(\beta_0\) variance to 100 and the others to 10, and then I will try to tune it based on the convergence metrics.
To carry out the Bayesian analysis with Markov Chain Monte Carlo (MCMC) methods I took advantage of JAGS (Just Another Gibbs Sampler), that is a really powerful software which allows to fit complex models by just using a straightforward and flexible syntax, just like BUGS (Bayesian inference Using Gibbs Sampling) language.
Notice that JAGS uses precision and not the variance,
therefore i need to take the inverse of the variance when fitting the
model:
\[
\begin{array}{cl}
&\beta_0 \sim N(0,\frac{1}{1.67})= N(0,0.6)\\
&\beta_i \sim N(0,\frac{1}{10}) = N(0,0.1), \quad i = 1,...,8 \qquad
i \ne 7 \\
&\beta_7 \sim N(0,\frac{1}{1.67})= N(0,0.6)\\
\end{array}
\]
Unbalanced set
I used the following jag to make predictions over the unbalanced set:
"
model {
beta0 ~ dnorm(0, 0.6)
beta_gender ~ dnorm(0, 0.1)
beta_age ~ dnorm(0, 0.1)
beta_hypertension ~ dnorm(0, 0.1)
beta_heart_disease ~ dnorm(0, 0.1)
beta_smoking_history ~ dnorm(0, 0.1)
beta_bmi ~ dnorm(0, 0.1)
beta_HbA1c ~ dnorm(0, 0.6)
beta_blood ~ dnorm(0, 0.1)
for (i in 1:N) {
logit_p[i] <- beta0 +
beta_gender * x[i,1] +
beta_age * x[i,2] +
beta_hypertension * x[i,3] +
beta_heart_disease * x[i,4] +
beta_smoking_history * x[i,5] +
beta_bmi * x[i,6] +
beta_HbA1c * x[i,7] +
beta_blood * x[i,8]
p[i] <- 1 / (1 + exp(-logit_p[i]))
y[i] ~ dbern(p[i])
}
}
"And this it the code through which I computed the jag.
j_nobal <- jags(model.file='jags_nobal.txt',
data = data,
inits = NULL,
n.chains = 3,
n.iter = 4000,
n.burnin = 1000,
parameters.to.save = c("beta0","beta_gender",
"beta_age","beta_hypertension",
"beta_heart_disease","beta_smoking_history",
"beta_bmi","beta_HbA1c", "beta_blood"))## Inference for Bugs model at "jags_nobal.txt", fit using jags,
## 3 chains, each with 4000 iterations (first 1000 discarded), n.thin = 3
## n.sims = 3000 iterations saved
## mu.vect sd.vect 2.5% 25% 50% 75% 97.5%
## beta0 -13.291 0.848 -14.925 -13.886 -13.272 -12.718 -11.638
## beta_HbA1c 0.942 0.103 0.735 0.874 0.944 1.011 1.141
## beta_age 0.032 0.006 0.021 0.028 0.032 0.036 0.044
## beta_blood 0.023 0.002 0.018 0.021 0.023 0.024 0.028
## beta_bmi 0.062 0.016 0.032 0.051 0.061 0.072 0.092
## beta_gender -0.359 0.228 -0.805 -0.516 -0.354 -0.202 0.068
## beta_heart_disease 0.891 0.360 0.192 0.649 0.891 1.139 1.585
## beta_hypertension 0.113 0.306 -0.502 -0.096 0.123 0.319 0.688
## beta_smoking_history -0.270 0.071 -0.413 -0.317 -0.270 -0.221 -0.137
## deviance 474.943 13.039 451.879 465.339 474.514 483.060 502.772
## Rhat n.eff
## beta0 1.008 280
## beta_HbA1c 1.057 40
## beta_age 1.006 1100
## beta_blood 1.012 170
## beta_bmi 1.018 470
## beta_gender 1.004 1300
## beta_heart_disease 1.002 1600
## beta_hypertension 1.001 3000
## beta_smoking_history 1.003 1200
## deviance 1.006 390
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
##
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 84.6 and DIC = 559.6
## DIC is an estimate of expected predictive error (lower deviance is better).
## Accuracy: 0.9475
## Specificity: 0.9972
## Sensitivity: 0.4737
By looking at this plot we see that the results are really similar to the ones obtained with the classic logistic regression model over an unbalanced dataset. I am able to get a really high level of accuracy but with a very low sensitivity.
Balanced set
The model is specified using the probabilistic programming language syntax, which allows us to integrate prior knowledge with observed data to estimate the parameters of interest:
"
model {
beta0 ~ dnorm(0, 0.6)
beta_gender ~ dnorm(0, 0.1)
beta_age ~ dnorm(0, 0.2)
beta_hypertension ~ dnorm(0, 0.1)
beta_heart_disease ~ dnorm(0, 0.1)
beta_smoking_history ~ dnorm(0, 0.1)
beta_bmi ~ dnorm(0, 0.2)
beta_HbA1c ~ dnorm(0,0.4)
beta_blood ~ dnorm(0, 0.1)
for (i in 1:N) {
logit_p[i] <- beta0 +
beta_gender * x[i,1] +
beta_age * x[i,2] +
beta_hypertension * x[i,3] +
beta_heart_disease * x[i,4] +
beta_smoking_history * x[i,5] +
beta_bmi * x[i,6] +
beta_HbA1c * x[i,7] +
beta_blood * x[i,8]
p[i] <- 1 / (1 + exp(-logit_p[i]))
y[i] ~ dbern(p[i])
}
}
"And this it the code through which I computed the jag:
j_bal <- jags(model.file='jags_bal.txt',
data = data,
inits = NULL,
n.chains = 3,
n.iter = 4000,
n.burnin = 1000,
parameters.to.save = c("beta0","beta_gender","beta_age",
"beta_hypertension","beta_heart_disease",
"beta_smoking_history","beta_bmi",
"beta_HbA1c", "beta_blood"))## Inference for Bugs model at "jags_bal.txt", fit using jags,
## 3 chains, each with 4000 iterations (first 1000 discarded), n.thin = 3
## n.sims = 3000 iterations saved
## mu.vect sd.vect 2.5% 25% 50% 75% 97.5%
## beta0 -12.216 0.603 -13.429 -12.615 -12.197 -11.788 -11.106
## beta_HbA1c 1.022 0.070 0.887 0.975 1.022 1.071 1.162
## beta_age 0.043 0.004 0.035 0.040 0.043 0.046 0.051
## beta_blood 0.019 0.002 0.016 0.017 0.018 0.020 0.022
## beta_bmi 0.057 0.010 0.036 0.050 0.057 0.064 0.076
## beta_gender -0.076 0.135 -0.341 -0.169 -0.074 0.016 0.181
## beta_heart_disease 0.531 0.225 0.107 0.373 0.531 0.688 0.972
## beta_hypertension -0.077 0.218 -0.493 -0.227 -0.077 0.072 0.342
## beta_smoking_history -0.220 0.049 -0.309 -0.254 -0.220 -0.186 -0.123
## deviance 978.549 9.434 961.380 971.984 978.189 984.859 998.121
## Rhat n.eff
## beta0 1.002 3000
## beta_HbA1c 1.006 370
## beta_age 1.004 620
## beta_blood 1.004 580
## beta_bmi 1.008 290
## beta_gender 1.005 470
## beta_heart_disease 1.001 2900
## beta_hypertension 1.001 3000
## beta_smoking_history 1.001 3000
## deviance 1.001 3000
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
##
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 44.5 and DIC = 1023.1
## DIC is an estimate of expected predictive error (lower deviance is better).
## Accuracy: 0.8725
## Specificity: 0.8627
## Sensitivity: 0.8827
Also in this case, by balancing the class we get better results in a diagnostic context. As a matter of fact, the sensitivity definitely increased while both accuracy and specificity decreased a little bit. It’s clear how the class balancing affects both models.
Comparison of the models
DIC
The Deviance Information Criterion (DIC) is a measure used primarily in Bayesian model selection and model comparison by balancing model fit (goodness of fit) with model complexity, aiming to select models that fit the data well while penalizing overly complex models.
By looking at the results obtained from the two models, based on the DIC metric, looks like the first one is way better than the second one.
## Imbalanced dataset: 559.5663
## Balanced dataset: 1023.07
MCMC Diagnostics
R-hat
It is a measure used to assess convergence of multiple Markov Chain Monte Carlo (MCMC) chains in Bayesian analysis. R-hat is calculated using the variance of all samples pooled from different chains and the average variance of each individual chain: \[ \hat{R} = \sqrt{\frac{\text{Var}(\theta | y)}{W}} \]
- \(\text{Var}(\theta | y)\): Average
posterior variance of the parameter across all chains.
- \(W\): Average of the within-chain variances.
We can give the following interpretation of \(\hat{R}\):
- \(\hat{R} < 1.05\): Chains are
likely converged.
- \(\hat{R} > 1.05\): Indicates
potential lack of convergence.
By looking at the values for \(\hat{R}\) obtained form the two models we can see how all the variables are below the threshold, so it’s likely that all of them converged.
Effective Sample Size
Effective Sample Size (n.eff) is crucial for assessing the convergence of Bayesian models because it measures the quality of MCMC (Markov Chain Monte Carlo) samples. Unlike raw sample size, n.eff accounts for autocorrelation within the chain, indicating how many independent samples the chain effectively provides.
This metric complements other diagnostics like R-hat. While R-hat indicates if the chains have converged to a common distribution, n.eff ensures that the samples are sufficiently independent. Therefore, even with an R-hat close to 1, a low n.eff would signal that the parameter estimates might still be unreliable due to high autocorrelation.
By looking at the values below we notice how for both models there are pretty high values, except for beta_HbA1c and beta_0, which have been the toughest one to tune.
Balanced Dataset
## beta_age beta_blood beta_bmi
## 360.76233 357.10956 209.82009
## beta_gender beta_HbA1c beta_heart_disease
## 414.98531 106.47633 2473.43114
## beta_hypertension beta_smoking_history beta0
## 2673.77940 481.76347 58.71129
## deviance
## 96.79914
Unbalanced Dataset
## beta_age beta_blood beta_bmi
## 336.59755 245.46984 182.99146
## beta_gender beta_HbA1c beta_heart_disease
## 412.01325 101.35771 1559.40509
## beta_hypertension beta_smoking_history beta0
## 2289.66549 493.67744 58.19388
## deviance
## 81.50509
Trace plots
Trace plots help us understand whether the MCMC chains have stabilized, hence converged: without convergence the results may be biased or incorrect. Thanks to the trace plots is also possible to compare multiple chains to ensure they all converge to the same distribution.
Balanced Dataset
Unbalanced Dataset
Density plot
The density plot is used the show the distribution of parameter values across all chains after the burn-in iterations, which are discarded. It can also allow us understand where our parameters are concentrated, indicating the central tendency and spread of the parameter estimates.
Plotting the density plots with trace plots allow us to check that individual chains have converged to similar distribution and so that the combined distribution accurately represents the posterior distribution.
Balanced Dataset
Unbalanced Dataset
Autocorrelation
Autocorrelation within MCMC convergence refers to how much each sample in the chain is related to its immediate predecessor, influencing the exploration of parameter space. High autocorrelation implies that each new sample is close to the previous one (slower exploration, slower convergence). On the other hand, low autocorrelation means that successive samples are less dependent on each other (efficient exploration, faster convergence).
As a matter of fact, by looking at the plots below we can see how all variables converge to zero very quickly, except for ‘beta_HbA1c’ and for the intercept ‘beta_0’: their convergence is slower but eventually they also converge to zero.
Balanced Dataset
Unbalanced Dataset
Posterior predictions
Posterior prediction is used in Bayesian statistics to describe the process of making predictions for new data or future observations based on posterior distributions obtained from Bayesian inference.
In Bayesian statistics, the posterior distribution represents the updated belief about the parameters of interest after taking into account the observed data and prior beliefs. Once this posterior distribution is obtained, it can be used to generate predictions for new data points.
Balanced Dataset
After balancing the classes we have around 0.5 prob to predict a value below 0.5:
## [1] 0.5275
Unbalanced Dataset
Without balancing the classes the probability to observe a value below 0.5 is more than 0.9:
## [1] 0.9525
Conclusion
In this project I implemented a classic logistic regression and a bayesian logistic regression, and each model has been fitted on both the unbalanced and balanced dataset. This is a table that summarizes the results:
| Model | Accuracy | Sensitivity | Specificity |
|---|---|---|---|
| Bayesian Logistic Regression - Unbalanced | 0.9475 | 0.4737 | 0.9972 |
| Bayesian Logistic Regression - Balanced | 0.8725 | 0.8827 | 0.8627 |
| Logistic Regression - Unbalanced | 0.96 | 0.7105 | 0.9862 |
| Logistic Regression - Balanced | 0.875 | 0.8878 | 0.8627 |
Based on R-hat, autocorrelation, and ESS both bayesian models converged. The bayesian model with balanced dataset tends to have narrower credible intervals (smaller standard deviations) compared to the one with unbalanced dataset, indicating more certainty in parameter estimation. However, it has an higher deviance (lower deviance means a better fit), which indicates potentially poorer fit in the balanced dataset model, and also an higher DIC, indicating a poor goodness of fit.
Nevertheless, notice that this is a diagnostic model, so if we take account of the metrics (accuracy, specificity and sensitivity) the second model is more reliable for our needs.